Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught socket exception during timeout handling #310

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ajyoung
Copy link

@ajyoung ajyoung commented Aug 4, 2020

❓ What kind of change does this PR introduce?

  • 🐞 bug fix
  • 🐣 feature
  • πŸ“‹ docs update
  • πŸ“‹ tests/coverage improvement
  • πŸ“‹ refactoring
  • πŸ’₯ other

πŸ“‹ What is the related issue number (starting with #)

Fixes #210

❓ What is the current behavior? (You can also link to an open issue here)

When connections with no content that last for longer than the connection timeout are made to a cheroot server backed by pyOpenSSL, the connection timeout on the server side also causes an unhandled socket error that causes the thread to be dropped. Eventually, the server will extinguish its thread pool and become unresponsive.

See #210 (comment)

❓ What is the new behavior (if this is a feature change)?

Threads are not dropped when connections with no content that last longer than the connection timeout are made to the server.

πŸ“‹ Other information:

πŸ“‹ Checklist:

  • I think the code is well written
  • I wrote good commit messages
  • I have squashed related commits together after the changes have been approved
  • Unit tests for the changes exist
  • Integration tests for the changes exist (if applicable)
  • I used the same coding conventions as the rest of the project
  • The new code doesn't generate linter offenses
  • Documentation reflects the changes
  • The PR relates to only one subject with a clear title
    and description in grammatically correct, complete sentences

This change is Reviewable

@ajyoung
Copy link
Author

ajyoung commented Aug 5, 2020

The windows test suite failure has been happening on master for the last ten days as well. But I'm not sure yet about the ci/circleci: macos-build and continuous-integration/appveyor/pr failures.

@webknjaz
Copy link
Member

@ajyoung there's a few TLS tests that are flaky on slow machines like macOS and Windows. If they happen on master, ignore those.

Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajyoung thanks for the PR! The fix looks correct but this patch also needs a regression test so that it's checked in the CI and is not only confirmed working in your specific env.

@@ -1337,6 +1337,8 @@ def _conditional_error(self, req, response):

try:
req.simple_response(response)
except socket.error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would need a long comment explaining why this is necessary.

@stale
Copy link

stale bot commented Oct 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale This thing has been ignored for too long label Oct 12, 2020
@webknjaz webknjaz removed the stale This thing has been ignored for too long label Oct 19, 2020
@webknjaz
Copy link
Member

webknjaz commented Nov 2, 2020

@ajyoung any updates?

@ajyoung
Copy link
Author

ajyoung commented Nov 9, 2020

@webknjaz -- Sorry for the delay. I haven't forgotten about it but I need to the get the environment setup again so I can run and write the required tests. I'll try to get back to this before Thanksgiving in the US (November 27)

@ajyoung ajyoung force-pushed the 210 branch 2 times, most recently from 0500eb0 to d2fde4e Compare November 12, 2020 19:06
@ajyoung
Copy link
Author

ajyoung commented Nov 12, 2020

@webknjaz -- It looks like the test_wsgi tests will no longer run under python 2.7 due to warnings like "CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release"

Do you know a way to suppress this warning when running the tests under python 2.7?

@webknjaz
Copy link
Member

It's already suppressed on master and doesn't prevent anything from being executed in this PR. You may be looking at some old CI runs. Your test fails because you attempt using socket as a CM while it doesn't have such interface.

@ajyoung
Copy link
Author

ajyoung commented Nov 13, 2020

@webknjaz -- I fixed the python 2.7 syntax. The tests pass here on linux but one concern is that I tried the test on the master branch without the fix and the test passes there as well. It seems that either the attempt to set the thread count to 1 or the attempt to set the timeout to 1s is incorrect in the test (so the server isn't actually dropping all threads with just two client calls).

I can verify that when I run the master branch manually and the 210 branch manually, that I can reproduce the issue with the same python client code. I run the code below eleven times against the server running on the master branch and it drops all threads. But that doesn't seem to be happening in the automated test.

[root@host-4 ~]# cat sslperftest-simple-client-py2.py
#!/usr/bin/python2

import socket
import time

socket_type = socket.AF_INET

s = socket.socket(socket_type, socket.SOCK_STREAM)
s.connect(('host-4.example.com', 11827))
time.sleep(12)
s.close()

@webknjaz
Copy link
Member

@ajyoung that looks like a half of a reproducer, it'd be useful to see the exact server part of the repro too.

@ajyoung
Copy link
Author

ajyoung commented Nov 14, 2020

@webknjaz -- The server reproduction is here.

[root@host-4 ~]vi sslperftest-simple-pyopenssl.py py

#!/usr/bin/python2
certfile='/tmp/ssl-cert.pem'
keyfile='/tmp/ssl-cert-pk.pem'
cafile=None
ciphers=None

host='host-4.example.com'
port=11827

def raw_wsgi_app(environ, start_response):
    status = '200 OK'
    response_headers = [('Content-type','text/plain')]
    start_response(status, response_headers)
    return ['Hello world!']

from cheroot.ssl.pyopenssl import pyOpenSSLAdapter
from cheroot import wsgi

bind_addr = (host, port)
server = wsgi.Server(bind_addr, raw_wsgi_app, request_queue_size=32)
server.ssl_adapter = pyOpenSSLAdapter(certfile, keyfile, certificate_chain=cafile, ciphers=ciphers)

try:
    server.start()
except KeyboardInterrupt:
    server.stop()

@@ -1343,6 +1343,11 @@ def _conditional_error(self, req, response):

try:
req.simple_response(response)
except socket.error:
Copy link
Member

@webknjaz webknjaz Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajyoung After giving it some thought, I think that this a wrong fix. It relies on the leaking implementation detail.

Instead, we need to change the pyopenssl adapter code to raise a standardized exception defined in https://github.com/cherrypy/cheroot/blob/master/cheroot/errors.py, like in other except-blocks. This would abstract us from the underlying causes. The exception should probably inherit https://docs.python.org/3/library/exceptions.html#ConnectionError.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webknjaz -- I suppose in that case, we'd want to just add the errnum to errors.socket_errors_to_ignore and not throw an exception at all. But in this case, the errnum is -1. It cannot be translated to a specific error. Do we want to handle all -1 errors this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR We need to have our own exception declared for communicating this.

-1 is not at error code, it's just how OpenSSL signals that there's something wrong AFAIU. IIUC OpenSSL is implemented as a state machine and to figure out what error state it is in, one should call SSL_get_error(). For example, here's what PyOpenSSL does: https://github.com/pyca/pyopenssl/blob/52341e8/src/OpenSSL/SSL.py#L1530.

Not throwing an exception is a no-go. It'll probably break some of the mechanisms. Also, exceptions is how we communicate problems while returning some values mean that there's no problems at all which is not at all true.

Copy link
Author

@ajyoung ajyoung Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webknjaz -- I've looked at this more closely now. At the point highlighted above where the socket.error occurs, it is already clear that the underlying error is an SSL.SysCallError. At the point of execution highlighted in server.py above, the ssl context isn't available but the stack trace makes this clear:

[root@host-4 ~]# /opt/cloudera/parcels/KEYTRUSTEE_SERVER/bin/python sslperftest-simple-pyopenssl.py
Exception in thread CP Server Thread-1:
Traceback (most recent call last):
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/workers/threadpool.py", line 125, in run
    keep_conn_open = conn.communicate()
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/server.py", line 1290, in communicate
    self._conditional_error(req, '408 Request Timeout')
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/server.py", line 1339, in _conditional_error
    req.simple_response(response)
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/server.py", line 1111, in simple_response
    self.conn.wfile.write(EMPTY.join(buf))
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/makefile.py", line 78, in write
    data_mv[bytes_sent:bytes_sent + SOCK_WRITE_BLOCKSIZE],
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/ssl/pyopenssl.py", line 163, in send
    *args, **kwargs
  File "/opt/cloudera/parcels/KEYTRUSTEE_SERVER/lib/python2.7/site-packages/cheroot/ssl/pyopenssl.py", line 110, in _safe_call
    raise socket.error(errnum)
error: -1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need to raise a better error there. How about FatalSSLAlert?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a new commit that makes the fix by raising FatalSSLAlert in cheroot/ssl/pyopenssl.py instead of ignoring the socket.error in cheroot/server.py.

@webknjaz
Copy link
Member

@ajyoung looks like this failing test is related to the changes in this PR: https://github.com/cherrypy/cheroot/pull/310/checks?check_run_id=1397448854#step:14:682

Comment on lines 795 to 796
tlshttpserver = \
tls_single_thread_http_server((interface, port), tls_adapter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escaping EOL is harmful, wrap with parentheses instead

Suggested change
tlshttpserver = \
tls_single_thread_http_server((interface, port), tls_adapter)
tlshttpserver = (
tls_single_thread_http_server((interface, port), tls_adapter)
)

s = socket.socket(socket_type, socket.SOCK_STREAM)
s.connect((ip_addr, port))
time.sleep(3)
s.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe close this socket at the end of the test function?

pytest.param(ANY_INTERFACE_IPV6, marks=missing_ipv6),
),
)
def test_server_timeout_before_content_error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't have any asserts: I'd expect it to use something like capfd fixture and check what server outputs.

@ajyoung
Copy link
Author

ajyoung commented Dec 15, 2020

Just an update on this -- I will get back to it after Christmas, just before the new year.

@webknjaz webknjaz force-pushed the 210 branch 2 times, most recently from 158239e to ddee97e Compare October 2, 2022 02:36
webknjaz added a commit that referenced this pull request Mar 31, 2024
A DoS would happen in many situations, including TLS errors and
attempts to close the underlying sockets erroring out.

This patch aims to prevent a situation when the worker threads are
killed by arbitrary exceptions that bubble up to their entry point
layers that aren't handled properly or at all.

PR #649

Fixes #358
Fixes #354

Ref #310
Ref #346
Ref #375
Ref #599
Ref #641

Resolves #365
Copy link

codecov bot commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 83.56%. Comparing base (4ff92c4) to head (7ce37c1).
Report is 5 commits behind head on main.

❗ Current head 7ce37c1 differs from pull request most recent head e9330ac. Consider uploading reports for the commit e9330ac to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #310      +/-   ##
==========================================
+ Coverage   79.83%   83.56%   +3.72%     
==========================================
  Files          28       28              
  Lines        4112     4174      +62     
==========================================
+ Hits         3283     3488     +205     
+ Misses        829      686     -143     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uncaught socket exception during timeout handling.
2 participants